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Abstract 

We introduce incremental variational inference 
and apply it to latent Dirichlet allocation (LDA). 
Incremental variational inference is inspired by 
incremental EM and provides an alternative to 
stochastic variational inference. Incremental 
LDA can process massive document collections, 
does not require to set a learning rate, con¬ 
verges faster to a local optimum of the varia¬ 
tional bound and enjoys the attractive property of 
monotonically increasing it. We study the per¬ 
formance of incremental LDA on large bench¬ 
mark data sets. We further introduce a stochas¬ 
tic approximation of incremental variational in¬ 
ference which extends to the asynchronous dis¬ 
tributed setting. The resulting distributed algo¬ 
rithm achieves comparable performance as single 
host incremental variational inference, but with a 
significant speed-up. 


1. Introduction 


Approximate Bayesia n inference has become mainstream 
in machine learning (IBishop et al.L l2006t iMurphvi 1201 2h 
and enj oyed a (re)gained interest in the statistics com¬ 
muni ty ( Wang & Titteringtonl 2006t Armagan & Dunsonl 
201 ih . It constitutes an appealing alternative to Markov 
Chain Monte Carlo when one is interested in probabilis¬ 
tic data modelling. Approximate inference techniques are 
pragmatic, postulating an approximate model family and 
trying to find the best mod el within this family by opti - 
mizing a surrogate objective (IWainwright & Jordan!I2008h . 
They are also practical, as the code implementing these in¬ 
ference algorithms is relatively easy to de-bug. Lor exam¬ 
ple, variational inference monotonically increases the vari¬ 
ational objective. Hence, the bound provides a sanity check 


for correctness and can be used to monitor convergence. 


The amount of data being generated and collected today is 
tremendous. Lor example, at the time of writing, there are 
almost 5 million articles in Wikipedia. Amazon S3 holds 
trillions of objects and over 6 billion hours of video are 
watched each month on YouTube. In 2012, the number of 
active Lacebook users had surpassed 1 billion. The trend of 
“big data growth" presents enormous challenges for indus¬ 
try and creates a need to invent new algorithms capable of 
ingesting and processing massive data sets. 


Stochastic variational inference (IHoffman et al.L 12013h 


was a first step in this direction in the context of ap- 
proxi mate inference. I t relie s on stochastic optimiza¬ 
tion (IRobbins & Monrol Il9-5lh and was designed to han¬ 
dle very large data sets by processing the data sequentially. 
The drawback of stochastic variational inference requires 
to adjust additional parameters like the learning rate and the 
mini-batch size. Moreover, it does not share the attractive 
property of batch variational inference of monotonically in¬ 
creasing the bound while inferring the model parameters. 


The increasing availability of distributed architectures, 
such as multi-processor and grid-computing hardware, pro¬ 
vides an opportunity to device distributed inference al¬ 
gorithms able to take advantage of the infrastructure and 
perform well at scale. Recent attempts in this d i rectio n 
include the w o rk by IS mola & Narav anam urthv ( 2010l) : 
Newman et al. ( 200^ ; lAsuncion et alJ ~ (l2009 ). However, 


stochastic variational inference cannot easily be adapted to 
the distributed optimization setting. 


To address the shortcomings of stochastic variational 
inference, we introduce incremental variational infer- 
ence, which generali zes incremental EM proposed by 


Neal & HintonI (Il998h . Like stochastic variational infer¬ 


ence, incremental variational inference processes the data 
sequentially. However, it does not require to adjust the 
learning rate. By maintaining a set of local statistics, it 
also preserves the property of monotonically increasing the 
variational objective at each iteration. We further propose a 
stochastic modification of incremental variational inference 
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that can be executed in a distributed environment. We re¬ 
port signibcant horizontal speed-up, while sacrificing very 
little predictive performance. 


In this paper, we focus on Latent Dirichlet Alloca tion 
(LDA) ioriffiths & Stevv^ , 2004 : Blei et al. . 2003), a 
popular generative model for documents. However, it 
should be noted that the approximate inference scheme we 
introduce is general and it is applicable to any latent vari¬ 
able model with a set of local and global variables. 


Topic Models like LDA make the simplifying assumption 
that documents can be represented as bag-of-words. This 
means they ignore the sequential structure of the text. More 
specifically, LDA postulates the existence of a collection of 
K topics, each of which is defined as a categorical distri¬ 
bution over a vocabulary of size V. It further assumes that 
each document in a corpus of D documents is generated ac¬ 
cording to a document-specihc categorical distribution over 
these topics. 

Let us denote word n in document d by Xnd and its topic 
assignment by Znd- The generative model is dehned as fol¬ 
lows: 


Znd I Od ~ Categorical(0d), 

Xnd I Znd,{4>k}k=i Categorical(</.^^J, (1) 

where 6d ~ Dirichlet(aolif) and 4>j, ^ Dirichlet(^olv)- 
The parameters ao and /3o are non-negative reals. 

The paper is organized as follows. In Section|2l we review 
batch and stochastic variational inference for LDA. In Sec¬ 
tion [3 we introduce incremental variational inference and 
its stochastic counterpart. The asynchronous distributed in¬ 
ference algorithm for LDA is described in Section|4l After 
discussing related work in Section |5] we present results on 
several large benchmark data sets in Section|6l 


2. Variational Inference for LDA 


Bayesian inference is often difficult in practice as it re¬ 
quires the computation of analytically intractable inte¬ 
grals. One can circumvent this problem by resorting to 
Markov Chain Monte Carlo (MCMC) to simulate samples 
from the posterior. For example, collapsed Gibbs sam¬ 
pling has proven to be very suc cessful for inference in 
LDA ( Griffiths & Stewed 20041) . However, convergence 
of MCMC is notoriously difficult to verify. A more prag¬ 
matic approach is to consi der deterministic app roximations 
like variational i nference (Bishop et all 20061) or expecta¬ 
tion propagation ( Minka & Laffertv , 2002h . These methods 
turn the inference problem into an optimization problem, 
which is often more easy to tackle and to monitor conver¬ 
gence. 


Variational inference maximizes a lower bound to the log 


marginal likelihood of the data by approximating the true 
posterior by postulating a simpler distribution, which is 
parametrized by a set of free parameters. In the case of 
LDA, the variational bound is given by 

Inp(A) ^ (lnp(X, Z, 0, $)) + W[q{Z, 0, $)] 

= Inp(X) - KL[g(Z, 0, $)||p(Z, 0, $|X)], 

where X = {xnd}n,d, Z = {znd}n,d, 0 = {dd}d and 
4) = {0fe}fc. The notation (•) denotes an expectation wrt 
q{Z, 0, $), H[p] is the differential entropy and KL[( 7 ||p] is 
the Kullback-Leibler divergence wrt q. Maximizing this 
bound is equivalent to minimising the Kullback-Leibler di¬ 
vergence between the true posterior p(Z,Q,^\X) and the 
approximate posterior q{Z, 0, $). In general, this mini¬ 
mization problem is still problematic, unless we further re¬ 
strict the form of q{Z, 0, $). 

Mean held variational inference (MVI) assumes the latent 
variables and the parameters are independent when condi¬ 
tioning on the data, that is, q{Z, 0, $) = ^ q{znd) x 

Ild<l(^d) X rife d{4>k)- *^0 show that in this case 

the lower b ound is maximis ed when the factors are dehned 
as follows dBlei et all 12001 : 

q{znd) = Categorical(7r„d), T:knd oc 

q{ed) = Dirichlet(ad), otkd = cto + (wfed), 

9(0fc) = Dirichlet(/3^,), P^k = Po + {myk), (2) 


where nikd is the (unobserved) number of times topic k ap¬ 
peared in document d and niyk the (unobserved) number 
of times word token v was assigned to topic k in the cor¬ 
pus. Hence, the special quantities (rukd) and {ruyk) are 
expected counts under the variational approximation. They 
are respectively given by Y,n ^knd and Y,n,d (xndUknd- 
The function 6y{-) is Dirac’s delta centred at v. The ex¬ 
pectations (InOkd) and {Inpyk) are respectively given by 
tp{akd) - ^kd) and ijj{Pyk) - tp{J2y Pvk)- 


MVI is a coordinate ascent method that converges to a lo¬ 
cal maximum of the variational bound ( Beal , 20031) . Cy¬ 
cling through the updates for variational parameters in (|2]) 
ensures a monotonic increase of this bound. MVI is a 
batch inference approach: every update of the variational 
parameter /3„fe requires updating all word-specihc propor¬ 
tions TVnd beforehand, which is costly when the corpus 
is large. Stochastic variational inference (SVI) was re- 
centl y proposed in the context of LDA to address this prob¬ 
lem ( Hoffman et al. . 20101 : 2013 ). The goal was to speed 
up inference and to scale up LDA to very large data sets. 


SVI optimizes the lower bou nd by stochastic optimiza¬ 
tion ( Robbins & Monrol 19.51 ). It maintains a set of local 
and global parameters, which characterize the variational 
posteriors. Local variables are the indicator variables Z 
and the document-topic proportions 0, which are respec¬ 
tively characterized by the local parameters {TTnd}n,d and 
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{ad}d- The global variables are topic-word proportions 
which are characterized by the global parameters {/3^}fc. 
SVI considers a noisy, but unbiased estimate of the gradi¬ 
ents of the variational parameters associated to the global 
variables. 


This leads to the following updates (d ocument d being 
picked at random) (IHoffman et al.ll2013h : 


= (1 - + pA, 

Pvk — Po “f T) ^ ^ ^v{p^nd)'^knd-, (3) 


where Pt = oo and J2tPt < Throughout this 
work, we will use the learning rate pt = {t + where 
K € (0.5,1] and r ^ 0. 

Intuitively, the second term on the right hand side of Q is 
a noisy, but unbiased estimate of the expected number of 
counts appearing in (|2]i, namely (rrivk)- The variational pa¬ 
rameters associated to the local variables (that is, TTnd and 
Oid) can be computed as in MVI. Typically, mini-batches 
are used to stabilize the gradients. An interesting property 
of SVI is that it corresponds to nat ural gradients with re - 
spect to the variational distribution ( Hoffman et ah . 2010l) . 


The intrinsic noise of the stochastic gradients can impede 
the convergence of SVI. Variance reductio n techniques 


have been proposed to a ddress this issue (IWang et al 


2013t iPaislev et al.L 12012). SVI is also sensitive to the 


learn ing rate decay s chedu le and choice of mini-batch 
size (IRanganath et al.i 120131) . Next, we derive incremental 
variational inference for LDA, which does not require to 
choose and adjust the learning rate. Importantly, it ensures 
a monotonic increase of the bound and convergence to a 
local maximum of the log marginal likelihood like MVI. 


3. Incremental Variational Inference for LDA 


Algorithm 1 Incremental Variational Inference (IVI) 

1: Initialize /3® randomly; set akd = cto- 

2: for f = 1,2, • • do 

3: Sample a document d uniformly 

4: repeat 

6- C^kd — “t“ 2^n—l ^knd 

7: until akd and TTknd converge. 

8: Pvk = + {rnvk) + ^i!i^nd)Aknd~'^knd 

9: end for 


IVI leads to the following incremental update: 


Nd 

Pvk = /3o + {kriyk) + ^■uiXnd)Aknd ~ '^knd 
n—1 


while the updates for 7r„d and ctd are the same as in MVI 
as they are associated to the local variables. The main ad¬ 
vantage of IVI is that it ensures a monotonic increase of 
the bound and does not require to have seen all the data 
points to make progress. The price we have to pay is that 
we have to store the previous set of proportions 'Kndi which 
can be costly when the number of topics K is large as the 
additional memory requirements scale as a constant factor 
times the number of words in the corpus. IVI for LDA is 
summarized in Algorithm[T] 


Subsequently, we will also consider a stochastic variant 
of the IVI algorithm (S-IVI), which is cl osely related to 


stoch astic average gradient (SAG) descent (ILe Roux et al. 


20121) . which maintains a running average of the gradi¬ 


ent. SAG has the low iteration cost of stochastic gradi¬ 
ent descent and the linear convergence rate of batch gradi¬ 
ent descent. S-IVI requires to set a learning rate, but it is 
amenable to the distributed variant discussed in the next 
section. It does not maintain strictly accurate sufficient 
statistics, rather it uses statistics computed as decaying av¬ 
erage of recently visited data points. The resulting update 
is given by 


Incremental variational inference (IVI) co mputes updates 
in a similar fashion as incremental EM (Neal & Hinton, 


19981). Each iteration performs a partial variational E-step 


before performing a variational M-step. This amounts to 
maintaining a set of global statistics associated to the global 
variables, which are updated incrementally in the varia¬ 
tional E-step by first subtracting the old statistics associ¬ 
ated to a data point (or a mini-batch) and adding back the 
corresponding new one. The updated global statistics are 
then used in the variational M-step. This is to be contrasted 
with SVI. Indeed, SVI uses a noisy estimate of the global 
statistics, which is based exclusively on the mini-batch that 
is considered in the current iteration. In the case of LDA, 


= (I- p^AAAptK 

Nd 

A = ^0 + {rriyk) + ^^(^^d){^kld - ^knd'’)^ (5) 

n—1 

where pt = {t + t)~'^ as in SVI. 

4. Distributed Variational Inference for LDA 

To speed up inference in the context of large data sets, SVI 
and IVI process document sequentially. In this section, we 
further scale up IVI by extending it to the distributed set¬ 
ting. We introduce asynchronous distributed incremental 
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Algorithm 2 Distributed IVI (D-IVI) 

1: Initialize randomly; set Ukd = cto- 
2: Set the step-size schedule pt 

3: Split documents into P disjoint subsets {Di, • • • ,Dp} 
4: for t = 1,2, • • ■ cx) do 

5: for each processor p G {1, •'' i P} in parallel do 

6: Sample a document d uniformly from Dp 

7: repeat 

8: oc 

9- tr/ed = CTo “f 1 

10: until akd and irknd converge. 

11: end for 

12 : Pvk = PoP{'^vk)+'^n=l^'^i^'nd){'^knd~'^knd 

13: /3W = (l-p,)/3i‘-'^+Pt^fe 

14: end for 


variational inference (D-IVI), which infers topics compa¬ 
rable to those inferred by S-IVI, but with a significant re¬ 
duction in computation time. 

Distributed inference algorithms handle multiple mini¬ 
batches in parallel to leverage distributed infrastructures. 
The key advantage of an asynchronous algorithm over a 
synchronous one is that it does not require a global syn¬ 
chronization step. As a result, it is not limited by the speed 
of the slowest processor (or worker). Moreover, the algo¬ 
rithm needs to be fault-tolerant, meaning that it needs to be 
robust to delays and possibly inaccurate updates. 

The S-IVI updates in Section |3] are amenable to a dis¬ 
tributed implementation with one master and P workers, 
each of which holds 1/P of the documents in the cor¬ 
pus. The workers hold the local parameters {'7Tnd}n,d and 
{a.d}d- They independently carry-out a variational E-step 
based on their possibly outdated copies of the global pa¬ 
rameters Once they are done, they send the cor¬ 

rected statistics associated to the mini-batch to the master, 
that is, En'i “ ^knd^)' The masters up¬ 

dates the global parameters according to (|5]l and sends back 
the updated value to the worker. In practice, there is a trade¬ 
off between the convergence speed and the amount of com¬ 
munication. Smaller mini-batches speed up convergence of 
the algorithm, but increase the communication overhead. 
Algorithmic summarizes D-IVI. 

We conclude this section by noting that the SVI updates 
cannot be applied in the asynchronous distributed setting. 
Even in the case of only two processors, we encountered 
numerical issues. Even when taking small step sizes, we 
not able to ensure the convergence of the algorithm due to 
the stale global parameters. 


5. Related Work 


Collapsed variational inference for EDA ( Teh et al. . 20061) 
is the de facto standard for learning topic models on cor¬ 
pora of moderate size. Recently, SVI was introduced to 
scale up inference and making i t poss ib le to handle massive 
corpora ( Hoffman et al. . 201(i 20131) . Eoulds et al. ( 2013 ) 
take this work one step further by developing stochastic 
collapsed variational inference. Modificatio ns of SVI, such 


as su bsampling from data non- uniformlv (iGonalan et al 
2013 ) or using control variates ( Wang et al. . 2013I) . ha^ 


been proposed to reduce the variance in the the noisy gra¬ 
dient and_Jurther speed up convergence. Along similar 
lines. Paisley et al. ( 20121) develop an algorithm that allows 
for direct optimization of the variational lo wer bound for 
varian ce reduction in stochastic gradient and iMandt & Blei 
(I 2 OI 4 I) propose a variance reduction scheme tailored to SVI 
by averaging successively over the sufficient statistics of 
the local variational parameters. All these methods, how¬ 
ever, require to tune at least the learning rate and the mini¬ 
batch size. By contrast, IVI uses an incremental method to 
reduce the variance of the noisy natural gradient and has no 
lea rning rate. Our work is mos t closely related to the work 
by [Hughes & SudderthI (l2013l) . They generalize previous 
incremental variants of the EM algorithm and develop the 
memoized online variational inference algorithm which is 
analogous to IVI, but they do not consider the stochastic 
and distributed extensions of IVI. 

Various implementations and improvements have been ex¬ 
plored for developing distributed algorithms for LDA to 
improve scalability in terms of memory and computa¬ 
tion. Most works consider parallel algorithms that are syn¬ 
chronous. Besides, these studi es parallelize batch v aria- 
tional inference. Eor example, iNallapati et al.l (l2007l) de¬ 
scribe distributed mean-field variational EM for LDA. Like 
in the case of D-IVI it relies on the fact that the expen¬ 
sive variational E-step can easily be parallelized because 
the local variable are conditionally independent. However, 
the master node waits until eac h of the workers co mpletes 
its job to perform the M-step. IWolfe et al.l (l2008l) investi¬ 
gate the parallelization of both the E- and M-step of varia¬ 
tional EM for LDA. Each node computes partial statistics 
in a local E-Step, sends these to a central node, and re¬ 
ceives back completed statistics relevant for completing its 
local M-Step. This distributed version of LDA produces 
identical results to the sequential version of the algorithm 


but it requires a global synchronization step. IZhai et al. 


(120121) proposed a distributed variational inference algo¬ 
rithm using the MapReduce framework, where the E-step 
is done in the Mappers and the M-step in the Reducer. An- 
other set of works attempt t o distri b ute MCMC alg o rithms 
dSmola & Naravanar nurthv 2010l: Newman et al.L 2009: 
Nallanati et al.n2007 : Thiesson et al. . 2001 : Wolfe et al. . 
20081) . where workes concurrently run several Gibbs sam- 
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piers and perform a global update of the topic counts 
after the synchroniza tion. Up to our knowledge, only 


Asuncion et alJ ( 2009 ) propose an asynchronous approach 


for LDA, which is based on Gibbs sampling unlike D-IVI. 


6. Experiments and Results 

We carry out two types of experiments. First, we study the 
performance of IVI and S-IVI for LDA. We also bench¬ 
mark IVI against MVl and SVl on large document collec¬ 
tions. Second, we measure speed-ups that are obtained with 
our distributed algorithm (D-IVI). 

Hardware: Experiments were all run on a 32-core ma¬ 
chine with 3.6 GHz Intel Core 17-3820 processors and a 
total of 128GB of RAM. 


Data: We benchmark IVI on four corpora: Associ¬ 
ated Press articles , Newsgroup documents , Wikipedia 
articl es and the sc i entific abstracts from Arxiv reposi¬ 
tory ( Mandt & Blei . 20141) . Besides, we used two addi¬ 
tional large corpora to evaluate D-IVI: revi ews from Ama¬ 
zon w ebsite and New York Times articles ( Mandt & Bleii 
I2OI4I) . The characteristics of the datasets are reported in 
Table n 


Experimental Setup: To quantitatively evaluate the 
model, w e estimate the pr edictive probability over the vo¬ 
cabulary ( Blei et all 12003 ). We wish to achieve high aver¬ 
age per-word likelihood on held-out test documents. Un¬ 
der this metric, a higher score is better, as a better model 
will assign a higher probability to the held-out words. We 
learn the topics on the training corpus. We use half of each 
test document to estimate its topics proportions and use the 
remainder to compute the predictive distribution over the 
vocabulary. In all the experiments, we set the number of 
topics K to 100, the Dirichlet hyperparameters ao to 0.5 
and /3o to 0.05. For stochastic methods, we set the forget¬ 
ting constant k to 0.9 and the delay r to 1. 


6.1. IVI Prediction Results 

In the first set of experiments, we compare the different in¬ 
ference algorithms for LDA, using our own implementation 
of MVI, SVI and IVI. Figure [T] shows that IVI converges to 
a solution which is comparable or better than MVI, SVI 
and S-IVI and IVI converges faster than the other algo¬ 
rithms. We first compare the performances of IVI and MVI 
at the point where MVI converges to a solution. IVI yields 
the same result after processing half (Newsgroup) to tenth 
(Arxiv) of the documents that MVI has processed. Besides, 
we observe that IVI gives consistently better predictive per¬ 
formance than MVI when both of them converges to a so¬ 
lution. 


In Section |3l we mentioned that S-IVI does not main¬ 
tain strictly accurate sufficient statistics, but it uses statis¬ 
tics computed as decaying average of recently visited data. 
Hence, it requires less memory than IVI and improves SVI 
in terms of accuracy and speed. Figure [Uprovides the ex¬ 
perimental support for these claims. 

In the second set of experiments, we evaluate IVI with var¬ 
ious mini-batch sizes by computing the average predictive 
log likelihood on the test set. 


Associated Press Newsgroup 




Figure 2. Per-word predictive probabilitiy for LDA as a function 
of the number of documents. Each panel compares different val¬ 
ues of the mini-batch size on the Associated Press, Newsgroup, 
Wikipedia and Arxiv data sets. IVI on the full data converges 
faster when a smaller batch size is used. 


Next, we turn our attention to Figure|2] Fixing the hyperpa¬ 
rameters and the number of topics, we explored the effect 
various mini-batch sizes on all four corpora. IVI converges 
faster to a good solution for smaller ones. However, larger 
mini-batches lead to better final performance. 

6.2. D-IVI Convergence and Speed-up Results 

The purpose of these experiments is to measure speed-ups 
obtained with D-IVI. We report the performance of D-IVI 
on a single processor which corresponds to S-IVI for ref¬ 
erence; and compare it to the performance of D-IVI for a 
varying number of processors. We are interested in two as¬ 
pects of performance: the quality of the model learned and 
the time taken to learn the model. We record wall clock 
time and the log predictive probability on Customer Re¬ 
view, New York Times and Arxiv corpus. In the experi¬ 
ments, computations were done on P processors for D-IVI 
where P = {1, 2,4, 8,16, 32}. These results are averaged 
over 5 runs with random initializations. The results in 
Table |2] and Figure [3] show that the log predictive proba- 
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Table 1. Characteristics of data sets used in experiments. 

AP Newsgroup Wikipedia Arxiv Customer Review 


NYT 


Number of doeuments in training set 1246 13888 39565 782385 452944 

Number of doeuments in test set 1000 5000 10000 100000 100000 

Average number of words per document 198 249 260 116 151 

Number of words in vocabulary 10473 27059 42419 141927 120043 


290000 

10000 

232 

102660 


Associated Press 



Figure 1. Per-word predictive probabilitiy for LDA as a function of the number of processed documents. We compare results for the 
Associated Press, Newsgroup, Wikipedia and Arxiv data sets. Incremental approaches (IVI and S-IVI) converge to a higher value on all 
datasets. We reported results for 2 mini-batch sizes. 


Arxiv 


Customer Review 



New York Times 



^- 7.5 



Figure 4. Convergence results (per-word predictive probabilitiy for LDA model as a function of number of documents processed so far) 
for D-IVI on Arxiv, Customer Review and New York Times for varying number of processors. As the number of processors increases, 
the rate of convergence slows down. 


bility is essentially the same for the distributed models as 
their single-processor versions at P = 1. Errors due to the 
stale parameters caused slight variations in performance of 
D-IVI. This variation increases with the number of proces¬ 
sors. This is shown in Figure |3l where we report the box 
and whiskers plot of the log predictive probability. 

One of the main motivation for developing D-IVI is to re¬ 
duce computation time while retaining performance. The 
speed-up results shown in Table |2] and Figure [2 (bottom- 
right) demonstrate that the improvement in convergence 
speed by increasing the number of processors is to be mit¬ 


igated by the communication overhead. When number of 
processors is large, the data subset assigned to each proces¬ 
sor gets smaller. In this case, each update is less informa¬ 
tive, and more iterations are needed for convergence. To 
overcome communication overhead, we have used larger 
mini-batch size. Hence, more information are collected in 
each global parameter update, and so the number of itera¬ 
tions required for convergence is reduced. 

D-IVI increases inference speed. We observe a ~1.8 times 
speed-up for all three corpora when using P=2 processors; 
and ^7.8, ~8.6 and ^9.9 times speed-up respectively for 
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Figure 3. Log predictive probability comparisons for S-IVI and 
D-IVl for different number of processors on Arxiv, Customer 
Review and NYT. Bottom right: Speed-up results of D-IVI for 
varying number of processors with respect to single processor for 
Arxiv. Higher speed-up is obtained with larger mini-batches due 
to the diminished communication overhead. 


Table 2. Log-prediction-probability (LPP) and runtime (in terms 
of seconds per iteration) of the D-IVI for different number of 
mini-batch sizes and number of processors. 


Dataset 

Mini-batch 

Size 


Number of Processors 


1 

2 

4 

8 

16 

32 

Customer 

Review 

(CR) 

1000 

LPP 

-7.25 

-7.25 

-7.25 

-7.28 

-7.28 

-7.28 

Time 

13626 

8015 

4367 

3299 

2428 

2259 

2000 

LPP 

-7.26 

-7.26 

-7.26 

-7.26 

-7.28 

-7.28 

Time 

13162 

7607 

4126 

3082 

2237 

2113 

5000 

LPP 

-7.21 

-7.21 

-7.24 

-7.24 

-7.24 

-7.24 

Time 

13043 

7538 

3875 

2757 

1883 

1659 

New York 

Times 

(NYT) 

1000 

LPP 

-7.49 

-7.49 

-7.51 

-7.51 

-7.51 

-7.51 

Time 

12935 

6916 

3902 

2879 

1987 

1728 

2000 

LPP 

-7.49 

-7.49 

-7.51 

-7.51 

-7.51 

-7.51 

Time 

12906 

6826 

3716 

2648 

1956 

1701 

5000 

LPP 

-7.48 

-7.48 

-7.48 

-7.48 

-7.50 

-7.50 

Time 

12427 

6510 

3407 

2360 

1748 

1428 

Arxiv 

1000 

LPP 

-7.63 

-7.63 

-7.63 

-7.65 

-7.66 

-7.66 

Time 

16996 

9601 

5087 

3776 

2534 

2185 

2000 

LPP 

-7.55 

-7.55 

-7.56 

-7.56 

-7.56 

-7.56 

Time 

17845 

9857 

5110 

3678 

2453 

2158 

5000 

LPP 

-7.52 

-7.52 

-7.52 

-7.54 

-7.54 

-7.54 

Time 

17957 

10030 

4760 

3228 

2105 

1835 


CR, NYT and Arxiv datasets when using P=32 processors. 
These results suggest that asynchronous D-IVI converges 
to a solutions that exhibit a performance close to one ob¬ 
tained with S-IVI. 


Simulated Delays: Next, we add delays to some work¬ 
ers to explore the robustness of D-IVI. Figure |4] provides 
results when each processor sleeps with 0.5 probability for 
a small amount of time before sending the latest sufficient 
statistics correction to the master. The delay length is cho¬ 
sen randomly from a normal distribution with the mean p 
(in seconds) and a = /r/5. We have specified the upper 
limit of /r as twice the average time required to compute 
the sufficient statistics of a mini-batch. 

Here, we report performance by plotting log predictive 
probability against number of documents seen so far. Fig- 
ure|4]shows that as the number of processors increases, the 
rate of convergence slows down, since more iterations are 
needed for information to propagate to all the processors. 
However, it is important to note that one iteration in real 
time of D-IVI is up to number of processors times faster 
than one iteration of S-IVI, so D-IVI converges much more 
quickly than S-IVI (see Table|2]for time results). 


Customer Review 



Figure 5. Convergence of D-IVI when a delay is encountered. The 
delay time are sampled from A/’(p, cr^), for several values for p. 
As the fj, increases, the rate of convergence slows down. While 
curves with p= 500 and 1000 appear less smooth than the others, 
they are still heading steadily toward convergence. As the number 
of processors increases, the rate of convergence slows down 


Finally, we test if D-IVI is robust to extremely stale param¬ 
eters by increasing the delay. Figure |3 shows the results of 
this case. For CR corpora, the computation time of the suf¬ 
ficient statistics for a mini-batch of 1000 documents is 26 
seconds in average. Here, each processor sleeps with 0.25 
probability and the average delay is set to twice (50 sec¬ 
onds, p=200), 5-times and 10-times the computation time 
for a mini-batch. 

We see that, the D-IVI algorithm still converges even with 
considerable delays of 5 and 10 times the processing time 
for a mini-batch. Despite no formal convergence guaran¬ 
tees, D-IVI algorithm performs well empirically in all ex¬ 
periments we conducted on the three real-world data sets 
considered. 
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7. Conclusion 

We introduced incremental variational inference as an al¬ 
ternative to stochastic variational inference. The algorithm 
does not require to adjust the learning rate. We showed 
experimentally that the incremental approach converges 
faster and often to a better local optimum of the variational 
objective. Incremental variational inference processes doc¬ 
uments sequentially. It scales thus similarly to stochastic 
variational inference and is suitable when we can afford to 
incur an additional memory cost (which scales as 0{KN)). 

We further modified incremental variational inference to 
accommodate a stochastic variant, which can be adapted to 
distributed environments. This enabled us to further scale 
variational inference. We showed experimentally that the 
proposed asynchronous algorithm is robust to noise and 
outdated parameters, and produces solutions that are very 
close to the single host solutions. The horizontal speed¬ 
up saturates when then number of processors increases as 
communication cost increases and more passes over the 
data are necessary to ensure convergence to the same level 
of accuracy. 

We left the convergence analysis of incremental variational 
inference to future work, as well as its application to other 
probabilistic models. Indeed, the incremental variational 
algorithms proposed in the paper are generic. They can be 
applied to any model with local and global variables and 
are by no means restricted to their application to LDA. 
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